Analyzing Foot Traffic Data in Philadelphia

Photo by Prasad Panchakshari on Unsplash

Project Overview

Overview of Data Sources

We've combined data from multiple sources:

The addition of each dataset allows us to ask more questions about these POIs, including:

Data Cleaning Methodology

Taking a closer look at SafeGraph's Point-of-Interest (POI) datasets

SafeGraph is an industry-leading data company that brings together highly accturate data from thousands of sources to deliver deep insights on millions of locations where consumers spend their money. This data is updated monthly to accurately account for store openings and closings. Safegraph offers three categories of data on their POI locations:

  1. Core Places - Basic business information including address, phone number, category, and open/closed status.
  2. Geometry - Building footprints with spatial hierarchy for all POIs in the Core dataset.
  3. Patterns - Foot-traffic insights for places derived from anonymized mobile devices.

"Our customers trust SafeGraph data to be clean and free of unnecessary noise. Our algorithms work over-time to remove and filter out irrelevant POIs and errors in the data. Better data means better decisions for your business." - SafeGraph

Below we will be taking a look at a monthly data extract containing POI data for Wawa locations in and around the Philadelphia area. In addition to the main dataset, we will be joining it to an additional file that Sagefraph provides that details the sample size of devices measured in each census block group (we'll explore this further a little later on). Safegraph states that they get a ~10% sample size of device data from their providers, and does their best to make sure they have a representative demographic sample. You can read more about how they tackle bias in their data collection in this blog post.

Census Block Group
A Census Block Group is a geographical unit used by the United States Census Bureau which is between the Census Tract and the Census Block. It is the smallest geographical unit for which the bureau publishes sample data, i.e. data which is only collected from a fraction of all households.

We will also be joining this data with SafeGraph's Open Census Data, a standardized and processed version of the U.S. Census Bureau's American Community Survey, to get the estimated population of each census block group.

Creating a function get_sg_data() to repeat this process for any POI and month

Exploratory Data Analysis on a single store

As you can see, there is a wealth of information about each of our store locations. When looking at the Patterns portion of the dataset, it becomes apparent that these columns are not easily able to be processed right away due to the different data embeddings encoded as strings. However, with a little elbow grease and Python data manipulation magic, we can uncover some pretty fascinating insights about our POI locations.

Creating the sg module

In order assist with all of the data manipulation steps needed to disect the foot traffic pattern columns and massage them into a workable format, we created and packaged several functions under the sg module to more efficiently process each store location. The source code of this module can be found at /src/scripts/sg.py.

Let's take a look at the first store in our dataset and do some basic analysis on some of the main columns:

Analyzing visits_by_day column using the sg.calc_visits_by_day() function

The visits_by_day column of the SageFraph dataset containes a string-encoded list of the number of visits to the POI location each day of the month. This list doesn't become useful until to map the values to the actual days and dates they correspond to, so that's what the sg.calc_visits_by_day() function is designed to do. In a nutshell, it uses the date_range_start and date_range_end columns from the POI record to create a pandas date_range index to create a row for every day in the date range. Then, it takes the list of daily visitsand maps the first value to day 1, second value to day 2, and so on. Calling the function,which takes a pandas Series object representing a single store and observation in the dataset and returns the below dataframe:

Plotting thevisits dataframe returned by sg.calc_visits_by_day()

SafeGraph Sampling and Estimating True Visits By Day

As discussed earlier, Safegraph receives a ~10% sample of the available device data, so the foot traffic numbers reported in the dataset are a lot lower than the real life visits. In order to correct for this sampling, when required, we must use the visitor_home_cbgs column that disaggregates the visitors by their home census block group. We then use our supplementary datasets to extrapolate and estimate the real number of visits based on SafeGraph's reported sample size for each of the census block groups and the actual reported 2016 population of each census block group. We created the sg.calc_true_values_by_day() function to perform these calculations based on this example.

Below is the returned dataframe from calling sg.calc_true_visits_by_day() on the same store Series:

Plotting the popularity_by_hour column

Plotting the device_type column

Plotting the bucketed_dwell_time column

Mapping Store Location vs. Visitor Home Census Block Location

By plotting this visualization, we can get a view of how far people are traveling to this store from their home location. It makes sense that there are a lot of visitors (block markers) that live near store (red marker), and it can be safely assumed that the visitors who live farther away visit that location either on their way to work or on the way to a different part of town.

Given that we only have data from the past few months, it makes sense that there are very few black markers vaery far away from the store given reduced travel due to COVID-19. If we can get access to any pre-covid data, it will be interesting to see how far the black markers get spread throughout the country as a possible micro-view into the overall tourism trends of Philadelphia.

Analyzing Multiple Stores and Brands

Starbucks vs Wawa Foot Traffic Share: October 2020

Additional Data Sources

Historical Weather Data from NOAA

Google's COVID-19 Community Mobility Reports

Appendix